Problem Statement

Business Context

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.

Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

Objective

SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.

To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.

Data Description

The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.

  • Product_Id - unique identifier of each product, each identifier having two letters at the beginning followed by a number.
  • Product_Weight - weight of each product
  • Product_Sugar_Content - sugar content of each product like low sugar, regular and no sugar
  • Product_Allocated_Area - ratio of the allocated display area of each product to the total display area of all the products in a store
  • Product_Type - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others
  • Product_MRP - maximum retail price of each product
  • Store_Id - unique identifier of each store
  • Store_Establishment_Year - year in which the store was established
  • Store_Size - size of the store depending on sq. feet like high, medium and low
  • Store_Location_City_Type - type of city in which the store is located like Tier 1, Tier 2 and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts.
  • Store_Type - type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2 and Food Mart
  • Product_Store_Sales_Total - total revenue generated by the sale of that particular product in that particular store

Installing and Importing the necessary libraries

In [10]:
#Installing the libraries with the specified versions
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.3 huggingface_hub==0.30.1 streamlit==1.43.2 shap -q

Note:

  • After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab) and run all cells sequentially from the next cell.

  • On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [11]:
import warnings
warnings.filterwarnings("ignore")

import streamlit as st

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# For splitting the dataset
from sklearn.model_selection import train_test_split

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Lib for displaying the updated dataset
from IPython.display import display
import torch
print("GPU Available:", torch.cuda.is_available())



# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)


# Libraries different ensemble classifiers
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn import metrics

# Libraries to get different metric scores
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    mean_absolute_percentage_error
)

# To create the pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler,OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline,Pipeline

# To tune different models and standardize
from sklearn.model_selection import GridSearchCV

# To serialize the model
import joblib

# os related functionalities
import os

# API request
import requests

# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi, create_repo

import shap
GPU Available: True

Loading the dataset

In [12]:
# import from google drive
from google.colab import drive

# import secrets from google drive
from google.colab import userdata

drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Data Overview

In [14]:
data_ = pd.read_csv('/content/drive/MyDrive/SuperKart.csv')
batch_test = pd.read_csv('/content/drive/MyDrive/SuperKart_Batch_.csv')

Make a copy

In [15]:
# Elementary-level tuning
data = data_.copy()

# Adavanced tuning (binning + dummies)
xgb_data = data.copy()

Lets review what we are working with

In [16]:
display(data.nunique())
0
Product_Id 8763
Product_Weight 1113
Product_Sugar_Content 4
Product_Allocated_Area 228
Product_Type 16
Product_MRP 6100
Store_Id 4
Store_Establishment_Year 4
Store_Size 3
Store_Location_City_Type 3
Store_Type 4
Product_Store_Sales_Total 8668

Drop Product_Id - Insignificant as this is exacly unique per entry. No value in keeping it.

In [17]:
data.drop('Product_Id', axis=1, inplace=True)
xgb_data.drop('Product_Id', axis=1, inplace=True)

Select categorical and numerical columns programmatically

In [18]:
# select numerical columns from dataset
numerical_cols = [feature for feature in data.columns if data[feature].dtypes != 'O']

# select categorical columns from dataset
categorical_cols = [feature for feature in data.columns if data[feature].dtypes == 'O']

# lets also demarcate XGB copy DF
xgb_numerical_cols = [feature for feature in xgb_data.columns if xgb_data[feature].dtypes != 'O']

xgb_categorical_cols = [feature for feature in xgb_data.columns if xgb_data[feature].dtypes == 'O']
In [19]:
data.shape
Out[19]:
(8763, 11)

Unique value counts for dataset

In [20]:
display(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product_Weight             8763 non-null   float64
 1   Product_Sugar_Content      8763 non-null   object 
 2   Product_Allocated_Area     8763 non-null   float64
 3   Product_Type               8763 non-null   object 
 4   Product_MRP                8763 non-null   float64
 5   Store_Id                   8763 non-null   object 
 6   Store_Establishment_Year   8763 non-null   int64  
 7   Store_Size                 8763 non-null   object 
 8   Store_Location_City_Type   8763 non-null   object 
 9   Store_Type                 8763 non-null   object 
 10  Product_Store_Sales_Total  8763 non-null   float64
dtypes: float64(4), int64(1), object(6)
memory usage: 753.2+ KB
None
  • 11 columns - after dropping 'Product_Id'
  • 8,763 rows

Duplicated input review

In [21]:
duplicate_percentage = data.duplicated().mean() * 100
print(f"Percentage of duplicate rows: {duplicate_percentage:.2f}%")
Percentage of duplicate rows: 0.00%

Empty values

In [22]:
display(round(data.isnull().sum() / data.isnull().count() * 100, 2))
0
Product_Weight 0.0
Product_Sugar_Content 0.0
Product_Allocated_Area 0.0
Product_Type 0.0
Product_MRP 0.0
Store_Id 0.0
Store_Establishment_Year 0.0
Store_Size 0.0
Store_Location_City_Type 0.0
Store_Type 0.0
Product_Store_Sales_Total 0.0

Numerical columns overview

In [23]:
display(data.describe(include='number').T)
count mean std min 25% 50% 75% max
Product_Weight 8763.0 12.653792 2.217320 4.000 11.150 12.660 14.180 22.000
Product_Allocated_Area 8763.0 0.068786 0.048204 0.004 0.031 0.056 0.096 0.298
Product_MRP 8763.0 147.032539 30.694110 31.000 126.160 146.740 167.585 266.000
Store_Establishment_Year 8763.0 2002.032751 8.388381 1987.000 1998.000 2009.000 2009.000 2009.000
Product_Store_Sales_Total 8763.0 3464.003640 1065.630494 33.000 2761.715 3452.340 4145.165 8000.000

Categorical columns overview

In [24]:
# lets check the proportion of the mode against total value counts
df = data.describe(include='object').T
df['mode_proportion'] = df['freq'] / df['count']
display(df)
count unique top freq mode_proportion
Product_Sugar_Content 8763 4 Low Sugar 4885 0.557457
Product_Type 8763 16 Fruits and Vegetables 1249 0.142531
Store_Id 8763 4 OUT004 4676 0.533607
Store_Size 8763 3 Medium 6025 0.68755
Store_Location_City_Type 8763 3 Tier 2 6262 0.714595
Store_Type 8763 4 Supermarket Type2 4676 0.533607

Proportions overview

  • Highly skewed proportions (uneven distributions).
  • Can introduce sparsity and noise into tree-based models (especially if some categories are rarely seen during training).
Column Mode Proportion
Product_Sugar_Content Low Sugar 55.7%
Product_Type Fruits and Vegetables 14.3%
Store_Id OUT004 53.3%
Store_Size Medium 68.8%
Store_Location_City_Type Tier 2 71.5%
Store_Type Supermarket Type2 53.4%
In [25]:
display(data.head(25))
Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
0 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40
1 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02
2 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16
3 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18
4 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36
5 12.03 Low Sugar 0.053 Snack Foods 113.64 OUT004 2009 Medium Tier 2 Supermarket Type2 2629.15
6 16.35 Low Sugar 0.112 Meat 185.71 OUT003 1999 Medium Tier 1 Departmental Store 5081.14
7 12.94 No Sugar 0.286 Household 194.75 OUT003 1999 Medium Tier 1 Departmental Store 4494.62
8 9.45 Low Sugar 0.047 Snack Foods 95.95 OUT002 1998 Small Tier 3 Food Mart 1684.82
9 8.94 No Sugar 0.045 Health and Hygiene 143.01 OUT004 2009 Medium Tier 2 Supermarket Type2 2531.30
10 10.64 Low Sugar 0.020 Hard Drinks 165.99 OUT004 2009 Medium Tier 2 Supermarket Type2 3385.46
11 13.92 Low Sugar 0.099 Fruits and Vegetables 116.89 OUT004 2009 Medium Tier 2 Supermarket Type2 3123.19
12 10.97 Low Sugar 0.156 Fruits and Vegetables 175.71 OUT004 2009 Medium Tier 2 Supermarket Type2 3660.82
13 12.25 Low Sugar 0.040 Canned 160.42 OUT004 2009 Medium Tier 2 Supermarket Type2 3635.50
14 11.31 Regular 0.049 Baking Goods 168.40 OUT004 2009 Medium Tier 2 Supermarket Type2 3587.27
15 13.26 Low Sugar 0.024 Breads 156.17 OUT004 2009 Medium Tier 2 Supermarket Type2 3776.76
16 10.74 No Sugar 0.173 Household 138.19 OUT004 2009 Medium Tier 2 Supermarket Type2 2839.53
17 12.45 Low Sugar 0.045 Frozen Foods 143.25 OUT004 2009 Medium Tier 2 Supermarket Type2 3328.99
18 13.17 Low Sugar 0.075 Fruits and Vegetables 185.03 OUT003 1999 Medium Tier 1 Departmental Store 4347.91
19 12.67 No Sugar 0.038 Health and Hygiene 150.51 OUT004 2009 Medium Tier 2 Supermarket Type2 3529.29
20 10.57 Regular 0.110 Fruits and Vegetables 143.36 OUT004 2009 Medium Tier 2 Supermarket Type2 2907.64
21 16.75 No Sugar 0.043 Health and Hygiene 236.09 OUT003 1999 Medium Tier 1 Departmental Store 6643.63
22 14.47 Regular 0.010 Fruits and Vegetables 145.65 OUT004 2009 Medium Tier 2 Supermarket Type2 3835.95
23 11.67 Low Sugar 0.118 Canned 179.54 OUT004 2009 Medium Tier 2 Supermarket Type2 3895.80
24 13.32 Regular 0.007 Snack Foods 204.53 OUT001 1987 High Tier 2 Supermarket Type1 4780.97
In [26]:
display(data.tail(25))
Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
8738 8.41 Low Sugar 0.038 Dairy 125.52 OUT002 1998 Small Tier 3 Food Mart 2054.56
8739 12.29 No Sugar 0.061 Others 95.98 OUT001 1987 High Tier 2 Supermarket Type1 2327.53
8740 9.20 Low Sugar 0.054 Dairy 84.21 OUT002 1998 Small Tier 3 Food Mart 1387.57
8741 9.66 No Sugar 0.062 Health and Hygiene 178.03 OUT004 2009 Medium Tier 2 Supermarket Type2 3412.00
8742 15.13 Regular 0.125 Meat 203.55 OUT003 1999 Medium Tier 1 Departmental Store 5168.94
8743 13.63 No Sugar 0.035 Health and Hygiene 194.16 OUT001 1987 High Tier 2 Supermarket Type1 4638.82
8744 13.34 Regular 0.022 Seafood 144.41 OUT004 2009 Medium Tier 2 Supermarket Type2 3554.90
8745 11.47 Regular 0.115 Canned 109.47 OUT001 1987 High Tier 2 Supermarket Type1 2418.23
8746 14.72 Regular 0.033 Snack Foods 183.18 OUT001 1987 High Tier 2 Supermarket Type1 4659.17
8747 10.97 Low Sugar 0.033 Seafood 141.62 OUT004 2009 Medium Tier 2 Supermarket Type2 2961.60
8748 10.74 Low Sugar 0.018 Starchy Foods 136.63 OUT004 2009 Medium Tier 2 Supermarket Type2 2807.95
8749 9.20 No Sugar 0.050 Others 96.71 OUT002 1998 Small Tier 3 Food Mart 1644.40
8750 12.96 Regular 0.049 Dairy 150.67 OUT004 2009 Medium Tier 2 Supermarket Type2 3597.07
8751 11.98 Regular 0.073 Fruits and Vegetables 171.04 OUT004 2009 Medium Tier 2 Supermarket Type2 3792.49
8752 10.06 Low Sugar 0.127 Snack Foods 93.41 OUT001 1987 High Tier 2 Supermarket Type1 3998.13
8753 12.71 Low Sugar 0.039 Fruits and Vegetables 154.18 OUT004 2009 Medium Tier 2 Supermarket Type2 3611.62
8754 7.26 Low Sugar 0.054 Fruits and Vegetables 128.93 OUT002 1998 Small Tier 3 Food Mart 1175.03
8755 14.46 Low Sugar 0.030 Baking Goods 156.51 OUT001 1987 High Tier 2 Supermarket Type1 4056.67
8756 11.15 Low Sugar 0.096 Fruits and Vegetables 156.00 OUT004 2009 Medium Tier 2 Supermarket Type2 3297.99
8757 14.93 No Sugar 0.047 Household 202.51 OUT003 1999 Medium Tier 1 Departmental Store 4663.65
8758 14.80 No Sugar 0.016 Health and Hygiene 140.53 OUT004 2009 Medium Tier 2 Supermarket Type2 3806.53
8759 14.06 No Sugar 0.142 Household 144.51 OUT004 2009 Medium Tier 2 Supermarket Type2 5020.74
8760 13.48 No Sugar 0.017 Health and Hygiene 88.58 OUT001 1987 High Tier 2 Supermarket Type1 2443.42
8761 13.89 No Sugar 0.193 Household 168.44 OUT001 1987 High Tier 2 Supermarket Type1 4171.82
8762 14.73 Low Sugar 0.177 Snack Foods 224.93 OUT002 1998 Small Tier 3 Food Mart 2186.08

Exploratory Data Analysis (EDA)

Utility Functions

In [27]:
# function to review a combination of boxplots and histograms
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram


# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

def bin_categorical(series, threshold=0.10):
  value_counts = series.value_counts(normalize=True)
  major = value_counts[value_counts >= threshold].index
  return series.apply(lambda x: x if x in major else 'Other')

Proportions for numerical columns

In [28]:
for col in numerical_cols:
  display(data[col].value_counts(1))
proportion
Product_Weight
12.92 0.003081
11.43 0.002967
12.95 0.002739
13.23 0.002739
11.15 0.002625
... ...
18.01 0.000114
18.73 0.000114
7.36 0.000114
20.31 0.000114
6.89 0.000114

1113 rows × 1 columns


proportion
Product_Allocated_Area
0.021 0.018144
0.031 0.015862
0.026 0.015063
0.035 0.014835
0.039 0.014721
... ...
0.239 0.000114
0.298 0.000114
0.290 0.000114
0.251 0.000114
0.187 0.000114

228 rows × 1 columns


proportion
Product_MRP
160.78 0.000685
131.62 0.000571
145.62 0.000571
138.77 0.000571
165.24 0.000571
... ...
115.47 0.000114
133.13 0.000114
86.91 0.000114
82.03 0.000114
171.14 0.000114

6100 rows × 1 columns


proportion
Store_Establishment_Year
2009 0.533607
1987 0.180988
1999 0.153943
1998 0.131462

proportion
Product_Store_Sales_Total
3511.58 0.000342
5722.13 0.000228
2804.28 0.000228
3907.49 0.000228
2627.24 0.000228
... ...
3507.04 0.000114
3604.94 0.000114
4015.69 0.000114
4622.12 0.000114
2243.85 0.000114

8668 rows × 1 columns


Proportions for categorical columns

In [29]:
for col in categorical_cols:
  display(data[col].value_counts(1))
proportion
Product_Sugar_Content
Low Sugar 0.557457
Regular 0.256875
No Sugar 0.173342
reg 0.012325

proportion
Product_Type
Fruits and Vegetables 0.142531
Snack Foods 0.131119
Frozen Foods 0.092548
Dairy 0.090836
Household 0.084446
Baking Goods 0.081707
Canned 0.077257
Health and Hygiene 0.071665
Meat 0.070524
Soft Drinks 0.059226
Breads 0.022823
Hard Drinks 0.021226
Others 0.017232
Starchy Foods 0.016090
Breakfast 0.012096
Seafood 0.008673

proportion
Store_Id
OUT004 0.533607
OUT001 0.180988
OUT003 0.153943
OUT002 0.131462

proportion
Store_Size
Medium 0.687550
High 0.180988
Small 0.131462

proportion
Store_Location_City_Type
Tier 2 0.714595
Tier 1 0.153943
Tier 3 0.131462

proportion
Store_Type
Supermarket Type2 0.533607
Supermarket Type1 0.180988
Departmental Store 0.153943
Food Mart 0.131462

Normalize the Sugar Content field before encoding or modeling

In [30]:
data['Product_Sugar_Content'] = data['Product_Sugar_Content'].replace({
    'reg': 'Regular',
    'REG': 'Regular',
    'regular': 'Regular'
})

xgb_data['Product_Sugar_Content'] = data['Product_Sugar_Content'].replace({
    'reg': 'Regular',
    'REG': 'Regular',
    'regular': 'Regular'
})

Univariate Histogram + Boxplot Analysis

In [31]:
for col in numerical_cols:
  histogram_boxplot(data, col)

Univariate Labeled Barplot

In [32]:
for col in categorical_cols:
  print("\n")
  labeled_barplot(data, col)
  print("\n")

















Define the target variable for the regression task

In [33]:
target = 'Product_Store_Sales_Total'

Smart binning strategy for high-cardinality numerical columns

In [34]:
def binning(series, bins=15):
    return pd.cut(series, bins=bins)

for col in categorical_cols:
    plt.figure(figsize=(14, 6))
    sns.countplot(x=col, data=data, order=data[col].value_counts().index)
    plt.title(f'Count of each category in {col}')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

for col in numerical_cols:
    if col != target:  # Avoid redundant plot if already visualized elsewhere
        plt.figure(figsize=(14, 6))
        binned_col = binning(data[col], bins=20)
        sns.countplot(x=binned_col, data=data)
        plt.title(f'Binned Count Distribution of {col}')
        plt.xticks(rotation=90)
        plt.tight_layout()
        plt.show()

Univariate Analysis Observations

  • Product_MRP and Product_Weight distributions are right-skewed, indicating the presence of premium or high-weight outliers.
  • Store_Establishment_Year has a discrete spread, suggesting that store maturity might influence regional performance or inventory capacity.
  • Product_Sugar_Content required label normalization (e.g., mapping 'reg' to 'Regular'), highlighting the importance of data consistency.
  • High frequency of the 'Low Sugar' category suggests health-conscious trends or merchandising decisions driven by consumer demand.

Bivariate Analysis

Bivariate Boxplot

In [35]:
for col in numerical_cols:
    if col != target:
        plt.figure(figsize=(15, 6))
        binned_col = binning(data[col])
        sns.boxplot(x=binned_col, y=target, data=data)
        plt.title(f'{col} (binned) vs {target}')
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.show()

Numerical columns V.S Target Scatterplots

In [36]:
# Scatterplot
for col in numerical_cols:
    if col != target:
        plt.figure(figsize=(15, 6))
        sns.scatterplot(x=col, y=target, data=data)
        plt.title(f'{col} vs {target}')
        plt.tight_layout()
        plt.show()

Bivariate Analysis Observations

  • Product_MRP and Product_Store_Sales_Total show a strong positive relationship, reinforcing the pricing impact on revenue.
  • Product_Weight also correlates positively with revenue, implying that bulk or high-mass products contribute more to sales totals.
  • Boxplots reveal that product category types (e.g., Snack Foods, Household) vary widely in revenue impact, showing potential for strategic SKU prioritization.
  • Store_Size and Store_Type appear to have stratified impacts on sales, though variance remains within category levels.

Multivariate Analysis

Heatmap for numerical columns

In [37]:
sns.set(rc={'figure.figsize':(16,10)})
sns.heatmap(data.corr(numeric_only = True),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False,
            cmap="Spectral")
plt.show()

Heatmap Observations

  • Significant Correlations
Feature Pair Correlation Interpretation
Product_MRP ↔ Product_Store_Sales_Total 0.79 Strong positive linear correlation. Higher MRP is associated with higher total sales. Useful for prediction.
Product_Weight ↔ Product_Store_Sales_Total 0.74 Heavier products are linked with higher sales—likely due to bulk or premium status.
Product_Weight ↔ Product_MRP 0.53 Moderately correlated. Heavier items may be priced higher. Possible multicollinearity warning.

  • Insignificant or Weak Correlations
Feature Pair Correlation Observation
Product_Allocated_Area ↔ All Other Features ~0.00 No correlation. Likely independent. Low predictive value unless nonlinear.
Store_Establishment_Year ↔ Product_Store_Sales_Total -0.19 Weak negative correlation. Newer stores may show slightly lower sales.
Product_MRP ↔ Store_Establishment_Year -0.19 Weak negative relationship. Potentially negligible for modeling.

  1. Multicollinearity:

    • Product_Weight and Product_MRP both strongly correlate with Product_Store_Sales_Total.
  2. Feature Redundancy:

    • High pairwise correlations can dominate PCA components.
  3. Model Implications:

    • Tree-based models (e.g., Random Forest, XGBoost) can handle these correlations.
  4. Feature Prioritization:

    • High value: Product_MRP, Product_Weight
    • Low value: Product_Allocated_Area, Store_Establishment_Year

Categorical columns V.S Target Boxplots

In [38]:
# Pairplot
# Get the name of the target Series
target_column = target.name if isinstance(target, pd.Series) else target

# Now form the correct DataFrame slice
pairplot_data = data[numerical_cols ]

for col in pairplot_data.columns:
    print(f"{col}: {pairplot_data[col].shape} | type: {type(pairplot_data[col].values)}")

# Boxplot
for col in categorical_cols:
    plt.figure(figsize=(14, 6))
    sns.boxplot(x=col, y=target, data=data)
    plt.title(f'{col} vs {target}')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
Product_Weight: (8763,) | type: <class 'numpy.ndarray'>
Product_Allocated_Area: (8763,) | type: <class 'numpy.ndarray'>
Product_MRP: (8763,) | type: <class 'numpy.ndarray'>
Store_Establishment_Year: (8763,) | type: <class 'numpy.ndarray'>
Product_Store_Sales_Total: (8763,) | type: <class 'numpy.ndarray'>

Colored binning of Scatterplot data

In [39]:
plt.figure(figsize=(8,6))
plt.scatter(
    x=pairplot_data['Product_Weight'],
    y=pairplot_data['Product_MRP'],
    c=pairplot_data['Product_Store_Sales_Total'],
    cmap='viridis',
    alpha=0.5
)
plt.colorbar(label='Sales Revenue')
plt.xlabel('Product Weight')
plt.ylabel('Product MRP')
plt.title('MRP vs Weight colored by Revenue')
plt.tight_layout()
plt.show()

Colored Pairplots of our Target

In [40]:
# Create a categorical bin of our continuous target
pairplot_data['Revenue_Bin'] = pd.qcut(pairplot_data['Product_Store_Sales_Total'], q=4, labels=["Low", "Med-Low", "Med-High", "High"])

# Plot with color by bin
sns.pairplot(pairplot_data, diag_kind='kde', hue='Revenue_Bin', plot_kws={'alpha': 0.6})
plt.suptitle("Pairplot Colored by Revenue Bin", y=1.02)
plt.show()

Multivariate Analysis

  • Pairplot analysis confirms correlated behavior between Product_Weight, Product_MRP, and Product_Store_Sales_Total.
  • There is no strong evidence of multicollinearity; feature independence is reasonably preserved, which is favorable for regression interpretability.
  • Visual binning of the revenue variable reveals natural clustering in product characteristics, particularly in high-revenue segments where high MRP and weight dominate.
  • Store_Establishment_Year does not show a strong multivariate interaction effect, aligning with its lower feature importance score in model evaluation.

Data Preprocessing

In [41]:
# Define predictor matrix (X) using selected numeric and categorical features
X = data[numerical_cols + categorical_cols]

# Define target variable
y = data[target]
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,              # Predictors (X) and target variable (y)
    test_size=0.2,     # 20% of the data is reserved for testing
    random_state=33    # Ensures reproducibility by setting a fixed random seed
)
In [42]:
# Create a preprocessing pipeline for numerical and categorical features

# Feature groups
numerical_columns = ['Product_Weight', 'Product_MRP']
onehot_cols = ['Product_Type', 'Store_Type', 'Store_Location_City_Type']
ordinal_cols = ['Store_Size', 'Product_Sugar_Content']

# Define custom orderings for ordinal encoding
store_size_order = ['Small', 'Medium', 'High']
sugar_content_order = ['No Sugar', 'Low Sugar', 'Regular']
ordinal_categories = [store_size_order, sugar_content_order]

# Define pipeline for numerical columns
numerical_pipeline = Pipeline([
    ('num_imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Define pipeline for ordinal columns
ordinal_pipeline = Pipeline([
    ('ord_imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(categories=ordinal_categories))
])

# Define pipeline for one-hot columns
onehot_pipeline = Pipeline([
    ('cat_imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine into a single ColumnTransformer
preprocessor = make_column_transformer(
    (numerical_pipeline, numerical_columns),
    (ordinal_pipeline, ordinal_cols),
    (onehot_pipeline, onehot_cols)
)

Model Building

Define functions for Model Evaluation

In [43]:
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mean_absolute_percentage_error(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

The ML models to be built can be any two out of the following:

  1. Decision Tree
  2. Bagging
  3. Random Forest
  4. AdaBoost
  5. Gradient Boosting
  6. XGBoost

Random Forest Regressor

In [44]:
# Define base Random Forest model
rf_model = RandomForestRegressor(random_state=33)
# Create pipeline with preprocessing and Random Forest model
rf_pipeline = make_pipeline(preprocessor, rf_model)
# Train the model pipeline on the training data
rf_pipeline.fit(X_train, y_train)
Out[44]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Product_Weight',
                                                   'Product_MRP']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('ord_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ordinal',
                                                                   OrdinalEncoder(categories=[['Small',
                                                                                               'Medi...
                                                                                               'High'],
                                                                                              ['No '
                                                                                               'Sugar',
                                                                                               'Low '
                                                                                               'Sugar',
                                                                                               'Regular']]))]),
                                                  ['Store_Size',
                                                   'Product_Sugar_Content']),
                                                 ('pipeline-3',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Type', 'Store_Type',
                                                   'Store_Location_City_Type'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=33))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [45]:
rf_estimator_model_train_perf = model_performance_regression(rf_pipeline, X_train,y_train)
print("Training performance \n")
rf_estimator_model_train_perf
Training performance 

Out[45]:
RMSE MAE R-squared Adj. R-squared MAPE
0 108.386697 40.019683 0.989717 0.989701 0.015036
In [46]:
rf_estimator_model_test_perf = model_performance_regression(rf_pipeline, X_test,y_test)
print("Testing performance \n")
rf_estimator_model_test_perf
Testing performance 

Out[46]:
RMSE MAE R-squared Adj. R-squared MAPE
0 282.36374 100.810534 0.927921 0.927466 0.03839

XGBoost Regressor

In [47]:
# Define base XGBoost model
xgb_model = XGBRegressor(random_state=33, tree_method='gpu_hist', predictor='gpu_predictor')
# Create pipeline with preprocessing and XGBoost model
xgb_pipeline = make_pipeline(preprocessor, xgb_model)
# Train the model pipeline on the training data
xgb_pipeline.fit(X_train, y_train)
Out[47]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Product_Weight',
                                                   'Product_MRP']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('ord_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ordinal',
                                                                   OrdinalEncoder(categories=[['Small',
                                                                                               'Medi...
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, predictor='gpu_predictor', ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [48]:
xgb_estimator_model_train_perf = model_performance_regression(xgb_pipeline, X_train, y_train)
print("Training performance \n")
xgb_estimator_model_train_perf
Training performance 

Out[48]:
RMSE MAE R-squared Adj. R-squared MAPE
0 138.439099 64.075723 0.983225 0.983198 0.022897
In [49]:
xgb_estimator_model_test_perf = model_performance_regression(xgb_pipeline, X_test,y_test)
print("Testing performance \n")
xgb_estimator_model_test_perf
Testing performance 

Out[49]:
RMSE MAE R-squared Adj. R-squared MAPE
0 298.176312 124.69579 0.919622 0.919114 0.048048

Model Performance Improvement - Hyperparameter Tuning

Random Forest Tuned Regressor

In [50]:
# Choose the type of classifier.
rf_tuned = RandomForestRegressor(random_state=33)

# Create pipeline with preprocessing and XGBoost model
rf_pipeline = make_pipeline(preprocessor, rf_tuned)

# Grid of parameters to choose from
parameters = parameters = {
    'randomforestregressor__max_depth':[3, 4, 5, 6],
    'randomforestregressor__max_features': ['sqrt','log2',None],
    'randomforestregressor__n_estimators': [50, 75, 100, 125, 150]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(rf_pipeline, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)
Out[50]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Product_Weight',
                                                   'Product_MRP']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('ord_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ordinal',
                                                                   OrdinalEncoder(categories=[['Small',
                                                                                               'Medi...
                                                  ['Store_Size',
                                                   'Product_Sugar_Content']),
                                                 ('pipeline-3',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Type', 'Store_Type',
                                                   'Store_Location_City_Type'])])),
                ('randomforestregressor',
                 RandomForestRegressor(max_depth=6, max_features=None,
                                       n_estimators=150, random_state=33))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [51]:
rf_tuned_model_train_perf = model_performance_regression(rf_tuned, X_train, y_train)
print("Training performance \n")
rf_tuned_model_train_perf
Training performance 

Out[51]:
RMSE MAE R-squared Adj. R-squared MAPE
0 290.518889 153.679679 0.926124 0.926008 0.055117
In [52]:
rf_tuned_model_test_perf = model_performance_regression(rf_tuned, X_test, y_test)
print("Testing performance \n")
rf_tuned_model_test_perf
Testing performance 

Out[52]:
RMSE MAE R-squared Adj. R-squared MAPE
0 295.692691 153.146544 0.920955 0.920456 0.057907

Random Forest - Feature importances

In [53]:
# Get feature importances
rf_model_from_pipeline = rf_tuned.named_steps['randomforestregressor']
importances = rf_model_from_pipeline.feature_importances_

# Get the feature names from the preprocessor after transformation
# This is necessary because one-hot encoding creates new columns
preprocessor_step = rf_tuned.named_steps['columntransformer']

# Get the names of the transformed features
# The get_feature_names_out() method is available in newer versions of scikit-learn
try:
    features_after_preprocessing = preprocessor_step.get_feature_names_out()
except AttributeError:
    # Fallback for older scikit-learn versions if get_feature_names_out() is not available
    # This requires inspecting the transformer steps and their output shapes
    print("Warning: scikit-learn version might be old. Consider upgrading for get_feature_names_out().")
    # Attempt to manually construct feature names (less robust)
    numerical_features = preprocessor_step.transformers_[0][2]
    ordinal_features = preprocessor_step.transformers_[1][2]
    onehot_encoder = preprocessor_step.transformers_[2][1].named_steps['onehot']
    onehot_feature_names = onehot_encoder.get_feature_names_out(preprocessor_step.transformers_[2][2])
    features_after_preprocessing = np.concatenate([numerical_features, ordinal_features, onehot_feature_names])


# Ensure the number of feature names matches the number of importances
if len(features_after_preprocessing) != len(importances):
    print(f"Mismatch: Found {len(features_after_preprocessing)} feature names but {len(importances)} importances.")
    print("Check your preprocessing steps and scikit-learn version.")
    # Handle the mismatch, maybe by raising an error or skipping plotting
    # For now, we will proceed assuming the corrected feature names match importances
else:
    features = features_after_preprocessing

# Create DataFrame for sorting
feat_imp_df = pd.DataFrame({
    'Feature': features,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feat_imp_df.head(15), palette='viridis')
plt.title("Top 15 Feature Importances (Random Forest)")
plt.tight_layout()
plt.show()

Random Forest - Residual Plots

In [54]:
# Predictions and residuals
y_pred = rf_tuned.predict(X_test)
residuals = y_test - y_pred

# Residuals vs Predicted
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_pred, y=residuals, alpha=0.5)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Values")
plt.tight_layout()
plt.show()

# Distribution of residuals
plt.figure(figsize=(8, 6))
sns.histplot(residuals, kde=True, bins=30)
plt.axvline(0, color='red', linestyle='--')
plt.title("Distribution of Residuals")
plt.xlabel("Residual")
plt.tight_layout()
plt.show()

XGB Tuned Regressor

In [55]:
# Choose the type of classifier.
xgb_tuned = XGBRegressor(
    random_state=33, tree_method='gpu_hist',
    predictor='gpu_predictor',  n_estimators=100,
    max_depth=6, learning_rate=0.1, objective='reg:squarederror')

# Create pipeline with preprocessing and XGBoost model
xgb_pipeline = make_pipeline(preprocessor, xgb_tuned)

#Grid of parameters to choose from
param_grid = {
    'xgbregressor__n_estimators': [50, 100, 150, 200],    # number of trees to build
    'xgbregressor___max_depth': [2, 3, 4],    # maximum depth of each tree
    'xgbregressor___colsample_bytree': [0.4, 0.5, 0.6],    # percentage of attributes to be considered (randomly) for each tree
    'xgbregressor___colsample_bylevel': [0.4, 0.5, 0.6],    # percentage of attributes to be considered (randomly) for each level of a tree
    'xgbregressor___learning_rate': [0.01, 0.05, 0.1],    # learning rate
    'xgbregressor___reg_lambda': [0.4, 0.5, 0.6],    # L2 regularization factor
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(xgb_pipeline, param_grid, scoring=scorer,cv=5,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
Out[55]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Product_Weight',
                                                   'Product_MRP']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('ord_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ordinal',
                                                                   OrdinalEncoder(categories=[['Small',
                                                                                               'Medi...
                              device=None, early_stopping_rounds=None,
                              enable_categorical=False, eval_metric=None,
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=0.1,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=6, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Training performance - Untuned - Tuned

In [56]:
xgb_tuned_model_train_perf = model_performance_regression(xgb_tuned, X_train, y_train)
print("Training performance \n")
xgb_tuned_model_train_perf
Training performance 

Out[56]:
RMSE MAE R-squared Adj. R-squared MAPE
0 237.245926 95.915174 0.950733 0.950656 0.036107

Testing peformance - Untuned - Tuned

In [57]:
xgb_tuned_model_test_perf = model_performance_regression(xgb_tuned, X_test, y_test)
print("Testing performance \n")
xgb_tuned_model_test_perf
Testing performance 

Out[57]:
RMSE MAE R-squared Adj. R-squared MAPE
0 277.815538 109.176188 0.930224 0.929783 0.043303

Hypertune binning + dummies - xgb_final data copy

In [58]:
xgb_data['Product_MRP_bin'] = pd.qcut(xgb_data['Product_MRP'], q=5, duplicates='drop')
xgb_data['Product_Weight_bin'] = pd.qcut(xgb_data['Product_Weight'], q=5, duplicates='drop')

xgb_data['Product_Type_Binned'] = bin_categorical(xgb_data['Product_Type'])
xgb_data['Store_Type_Binned'] = bin_categorical(xgb_data['Store_Type'])
xgb_data['Store_Location_City_Type_Binned'] = bin_categorical(xgb_data['Store_Location_City_Type'])


# Now apply one-hot encoding to the original and newly binned categorical columns
xgb_data = pd.get_dummies(xgb_data, columns=[
    'Product_Type_Binned', 'Store_Type_Binned', 'Store_Location_City_Type_Binned',
    'Product_MRP_bin', 'Product_Weight_bin'
], drop_first=False)
In [59]:
print("Columns in xgb_data:")
print(xgb_data.columns.tolist())
Columns in xgb_data:
['Product_Weight', 'Product_Sugar_Content', 'Product_Allocated_Area', 'Product_Type', 'Product_MRP', 'Store_Id', 'Store_Establishment_Year', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Store_Sales_Total', 'Product_Type_Binned_Fruits and Vegetables', 'Product_Type_Binned_Other', 'Product_Type_Binned_Snack Foods', 'Store_Type_Binned_Departmental Store', 'Store_Type_Binned_Food Mart', 'Store_Type_Binned_Supermarket Type1', 'Store_Type_Binned_Supermarket Type2', 'Store_Location_City_Type_Binned_Tier 1', 'Store_Location_City_Type_Binned_Tier 2', 'Store_Location_City_Type_Binned_Tier 3', 'Product_MRP_bin_(30.999, 121.024]', 'Product_MRP_bin_(121.024, 139.39]', 'Product_MRP_bin_(139.39, 154.356]', 'Product_MRP_bin_(154.356, 172.81]', 'Product_MRP_bin_(172.81, 266.0]', 'Product_Weight_bin_(3.999, 10.79]', 'Product_Weight_bin_(10.79, 12.08]', 'Product_Weight_bin_(12.08, 13.212]', 'Product_Weight_bin_(13.212, 14.516]', 'Product_Weight_bin_(14.516, 22.0]']
In [60]:
print("Number of features after encoding:", xgb_data.shape[1])
Number of features after encoding: 31
In [61]:
# Identify new encoded columns
encoded_cols = [col for col in xgb_data.columns if '_Binned' in col or '_bin_' in col]

# Filter only columns with more than one unique value
plottable_cols = [col for col in encoded_cols if xgb_data[col].nunique() > 1]

print(f"{len(plottable_cols)} of {len(encoded_cols)} dummy features have more than one unique value.")

# Skip if nothing to plot
if not plottable_cols:
    print("No dummy columns with variance to plot.")
else:
    # Prepare dynamic layout
    ncols = 3
    nrows = (len(plottable_cols) + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(5*ncols, 4*nrows))
    axes = axes.flatten()

    for idx, col in enumerate(plottable_cols):
        sns.histplot(data=xgb_data, x=col, bins=2, ax=axes[idx])
        axes[idx].set_title(f"{col} Distribution")

    # Remove unused axes
    for j in range(idx+1, len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()
    plt.show()
20 of 20 dummy features have more than one unique value.
In [62]:
# View dummy-encoded columns
encoded_cols = [col for col in xgb_data.columns if '_Binned_' in col or '_bin_' in col]
print("New dummy-encoded features:")
print(encoded_cols)
New dummy-encoded features:
['Product_Type_Binned_Fruits and Vegetables', 'Product_Type_Binned_Other', 'Product_Type_Binned_Snack Foods', 'Store_Type_Binned_Departmental Store', 'Store_Type_Binned_Food Mart', 'Store_Type_Binned_Supermarket Type1', 'Store_Type_Binned_Supermarket Type2', 'Store_Location_City_Type_Binned_Tier 1', 'Store_Location_City_Type_Binned_Tier 2', 'Store_Location_City_Type_Binned_Tier 3', 'Product_MRP_bin_(30.999, 121.024]', 'Product_MRP_bin_(121.024, 139.39]', 'Product_MRP_bin_(139.39, 154.356]', 'Product_MRP_bin_(154.356, 172.81]', 'Product_MRP_bin_(172.81, 266.0]', 'Product_Weight_bin_(3.999, 10.79]', 'Product_Weight_bin_(10.79, 12.08]', 'Product_Weight_bin_(12.08, 13.212]', 'Product_Weight_bin_(13.212, 14.516]', 'Product_Weight_bin_(14.516, 22.0]']
In [63]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 5))

# Histogram + KDE
sns.histplot(data=xgb_data, x='Product_Store_Sales_Total', kde=True, bins=30, ax=axes[0])
axes[0].set_title("Distribution of Product_Store_Sales_Total")
axes[0].set_xlabel("Sales Total")
axes[0].set_ylabel("Frequency")

# Boxplot
sns.boxplot(data=xgb_data, x='Product_Store_Sales_Total', ax=axes[1])
axes[1].set_title("Boxplot of Product_Store_Sales_Total")
axes[1].set_xlabel("Sales Total")

plt.tight_layout()
plt.show()
In [64]:
# Confirm this is coming from the preprocessed dummy-encoded DataFrame
# When preparing data for training:
X = xgb_data.drop(columns=[
    'Product_Store_Sales_Total',
    'Product_Type', 'Store_Type', 'Store_Location_City_Type',  # raw categoricals
    'Product_MRP', 'Product_Weight' # raw numerics if binned
])

y = xgb_data['Product_Store_Sales_Total']

# Drop any residual object-type columns if they slipped in
X = X.select_dtypes(include=['number']).copy()

# Create binned version of target for stratification
y_binned = pd.qcut(y, q=10, labels=False, duplicates='drop')  # Adjust `q` if needed

# Stratified train-test split (using binned target)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=33, stratify=y_binned
)

XGB_Plus Regressor - custom pipeline - binning

In [65]:
xgb_plus = XGBRegressor(
    random_state=33, tree_method='gpu_hist',
    predictor='gpu_predictor',  n_estimators=100,
    max_depth=6, learning_rate=0.1, objective='reg:squarederror')

# Create pipeline (no preprocessor since data is already processed)
xgb_pipeline = Pipeline([
    ('xgbregressor', xgb_plus)
])

# Run the grid search
grid_search = GridSearchCV(xgb_pipeline, param_grid, scoring=scorer,cv=5,n_jobs=-1)
grid_search = grid_search.fit(X_train, y_train)

# Best model
xgb_final = grid_search.best_estimator_

# Final training for safety
xgb_final.fit(X_train, y_train)
Out[65]:
Pipeline(steps=[('xgbregressor',
                 XGBRegressor(_colsample_bylevel=0.4, _colsample_bytree=0.4,
                              _learning_rate=0.01, _max_depth=2,
                              _reg_lambda=0.4, base_score=None, booster=None,
                              callbacks=None, colsample_bylevel=None,
                              colsample_bynode=None, colsample_bytree=None,
                              device=None, early_stopping_rounds=None,
                              enable_categorical=False, eval_metric=None,
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=0.1,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=6, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

xgb_final - Training and Test Performance

In [66]:
xgb_final_train_perf = model_performance_regression(xgb_final, X_train, y_train)
xgb_final_test_perf = model_performance_regression(xgb_final, X_test, y_test)

print("Train:")
display(xgb_final_train_perf)
print("Test:")
display(xgb_final_test_perf)
Train:
RMSE MAE R-squared Adj. R-squared MAPE
0 589.761623 462.309107 0.69546 0.695373 0.169177
Test:
RMSE MAE R-squared Adj. R-squared MAPE
0 594.757607 468.605728 0.680962 0.680597 0.164956

Model Performance Comparison, Final Model Selection, and Serialization

In [67]:
# Training performance comparison

models_train_comp_df = pd.concat(
    [rf_estimator_model_train_perf.T,rf_tuned_model_train_perf.T,
    xgb_estimator_model_train_perf.T,xgb_tuned_model_train_perf.T, xgb_final_train_perf.T],
    axis=1,
)

models_train_comp_df.columns = [
    "Random Forest Estimator",
    "Random Forest Tuned",
    "XGBoost",
    "XGBoost Tuned",
    "XGBoost Final",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[67]:
Random Forest Estimator Random Forest Tuned XGBoost XGBoost Tuned XGBoost Final
RMSE 108.386697 290.518889 138.439099 237.245926 589.761623
MAE 40.019683 153.679679 64.075723 95.915174 462.309107
R-squared 0.989717 0.926124 0.983225 0.950733 0.695460
Adj. R-squared 0.989701 0.926008 0.983198 0.950656 0.695373
MAPE 0.015036 0.055117 0.022897 0.036107 0.169177
In [68]:
# Testing performance comparison

models_test_comp_df = pd.concat(
    [rf_estimator_model_test_perf.T,rf_tuned_model_test_perf.T,
    xgb_estimator_model_test_perf.T,xgb_tuned_model_test_perf.T, xgb_final_test_perf.T],
    axis=1,
)

models_test_comp_df.columns = [
    "Random Forest Estimator",
    "Random Forest Tuned",
    "XGBoost",
    "XGBoost Tuned",
    "XGBoost_Final",
]

print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[68]:
Random Forest Estimator Random Forest Tuned XGBoost XGBoost Tuned XGBoost_Final
RMSE 282.363740 295.692691 298.176312 277.815538 594.757607
MAE 100.810534 153.146544 124.695790 109.176188 468.605728
R-squared 0.927921 0.920955 0.919622 0.930224 0.680962
Adj. R-squared 0.927466 0.920456 0.919114 0.929783 0.680597
MAPE 0.038390 0.057907 0.048048 0.043303 0.164956
In [69]:
(models_train_comp_df - models_test_comp_df).iloc[2]
Out[69]:
R-squared
Random Forest Estimator 0.061796
Random Forest Tuned 0.005168
XGBoost 0.063603
XGBoost Final NaN
XGBoost Tuned 0.020509
XGBoost_Final NaN

Model Serialization

In [70]:
# Create a folder for storing the files needed for web app deployment
os.makedirs("/content/deployment_files/Model", exist_ok=True)
# Define the file path to save (serialize) the trained model along with the data preprocessing steps
saved_model_path = "/content/deployment_files/Model/store-sales-prediction-model-v1-0.joblib"
# Save the best trained model pipeline using joblib
joblib.dump(xgb_tuned, saved_model_path)

print(f"Model saved successfully at {saved_model_path}")

# Load the saved model pipeline from the file
saved_model = joblib.load("/content/deployment_files/Model/store-sales-prediction-model-v1-0.joblib")

# Confirm the model is loaded
print("Model loaded successfully.")

saved_model
Model saved successfully at /content/deployment_files/Model/store-sales-prediction-model-v1-0.joblib
Model loaded successfully.
Out[70]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Product_Weight',
                                                   'Product_MRP']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('ord_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ordinal',
                                                                   OrdinalEncoder(categories=[['Small',
                                                                                               'Medi...
                              device=None, early_stopping_rounds=None,
                              enable_categorical=False, eval_metric=None,
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=0.1,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=6, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Deployment - Backend

In [71]:
# Google Colab secrets management
access_key = userdata.get('HF_TOKEN')
In [72]:
api = HfApi(token=os.getenv(access_key))
api.upload_folder(
    folder_path="/content/deployment_files/Model",
    repo_id="omoral02/RevenuePrediction",
    repo_type="model",
)
No files have been modified since last commit. Skipping to prevent empty commit.
WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.
Out[72]:
CommitInfo(commit_url='https://huggingface.co/omoral02/RevenuePrediction/commit/46fca53ebf65596b93d5477fdb3e054e9bb22fef', commit_message='Upload folder using huggingface_hub', commit_description='', oid='46fca53ebf65596b93d5477fdb3e054e9bb22fef', pr_url=None, repo_url=RepoUrl('https://huggingface.co/omoral02/RevenuePrediction', endpoint='https://huggingface.co', repo_type='model', repo_id='omoral02/RevenuePrediction'), pr_revision=None, pr_num=None)

Flask Web Framework

In [73]:
os.makedirs("/content/deployment_files/Backend", exist_ok=True)
In [74]:
%%writefile /content/deployment_files/Backend/app.py

# Import necessary libraries
import numpy as np
import joblib
import pandas as pd
from flask import Flask, request, jsonify
from huggingface_hub import hf_hub_download
import joblib
import tempfile
import streamlit as st
import io # Import io module

REPO_ID = "omoral02/RevenuePrediction"
FILENAME = "store-sales-prediction-model-v1-0.joblib"

# Write model to temp directory (writable in Hugging Face Spaces)
temp_dir = tempfile.gettempdir()
model_path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME, cache_dir=temp_dir)
model = joblib.load(model_path)

# Initialize the Flask app
superkart_api = Flask("SuperKart Sales Predictor")

# def transform_input_for_model(df_raw):
#     # Binning
#     df_raw['Product_MRP_bin'] = pd.qcut(df_raw['Product_MRP'], q=5, duplicates='drop')
#     df_raw['Product_Weight_bin'] = pd.qcut(df_raw['Product_Weight'], q=5, duplicates='drop')

#     df_raw['Product_Type_Binned'] = bin_categorical(df_raw['Product_Type'])
#     df_raw['Store_Type_Binned'] = bin_categorical(df_raw['Store_Type'])
#     df_raw['Store_Location_City_Type_Binned'] = bin_categorical(df_raw['Store_Location_City_Type'])

#     # Dummy encoding
#     df_encoded = pd.get_dummies(df_raw, columns=[
#         'Product_Type_Binned', 'Store_Type_Binned',
#         'Store_Location_City_Type_Binned',
#         'Product_MRP_bin', 'Product_Weight_bin'
#     ], drop_first=False)

#     # Drop original fields
#     df_encoded = df_encoded.select_dtypes(include=['number']).copy()

#     return df_encoded


# Load the trained model
@st.cache_resource
def load_model():
    return model

model = load_model()

# Define root endpoint
@superkart_api.get('/')
def home():
    return jsonify({"okay": "Welcome to the SuperKart Sales Prediction API!"})

# Endpoint for single record prediction
@superkart_api.post('/v1/predict')
def predict_sales():
    if model is None:
      return jsonify({"error": "Model not loaded"}), 500

    try:
        input_json = request.get_json()
        expected_fields = [
            'Product_Type', 'Store_Type', 'Store_Location_City_Type',
            'Store_Size', 'Product_Sugar_Content', 'Product_Weight',
            'Product_MRP', 'Product_Allocated_Area', 'Store_Establishment_Year'
        ]
        missing = [f for f in expected_fields if f not in input_json]
        if missing:
            return jsonify({
                'error': 'Missing required input fields.',
                'missing_fields': missing,
                'received_fields': list(input_json.keys())
            }), 400

        # Extract relevant inputs (must match training columns)
        features = {
            'Product_Type': input_json['Product_Type'],
            'Store_Type': input_json['Store_Type'],
            'Store_Location_City_Type': input_json['Store_Location_City_Type'],
            'Store_Size': input_json['Store_Size'],
            'Product_Sugar_Content': input_json['Product_Sugar_Content'],
            'Product_Weight': input_json['Product_Weight'],
            'Product_MRP': input_json['Product_MRP'],
            'Product_Allocated_Area': input_json['Product_Allocated_Area'],
            'Store_Establishment_Year': input_json['Store_Establishment_Year'],
        }

        input_df = pd.DataFrame([features])
        # df_transformed = transform_input_for_model(input_df)
        prediction = model.predict(input_df)[0]
        return jsonify({'Predicted_Store_Sales_Total': round(float(prediction), 2)})

    except Exception as e:
        print(f"Error during single prediction: {e}") # Added print for debugging
        return jsonify({"error": str(e), "message": "Prediction failed"}), 500 # Return error message and status code


# Endpoint for batch prediction using CSV
@superkart_api.post('/v1/batch')
def predict_sales_batch():
    try:
        uploaded_file = request.files['file']
        input_df = pd.read_csv(uploaded_file)
        expected_fields = [
            'Product_Type', 'Store_Type', 'Store_Location_City_Type',
            'Store_Size', 'Product_Sugar_Content', 'Product_Weight',
            'Product_MRP', 'Product_Allocated_Area', 'Store_Establishment_Year'
        ]
        missing = [f for f in expected_fields if f not in input_df.columns]
        if missing:
            return jsonify({
                'error': 'Missing required columns in uploaded CSV.',
                'missing_columns': missing,
                'received_columns': list(input_df.columns)
            }), 400
        # df_transformed = transform_input_for_model(input_df)
        predictions = model.predict(input_df).tolist()
        rounded_preds = [round(float(p), 2) for p in predictions]
        return jsonify({'Predicted_Store_Sales_Total': rounded_preds})

        # Optional: use product-store pair if available
        # if 'Product_Id' in df_transformed.columns and 'Store_Id' in df_transformed.columns:
        #     keys = df_transformed['Product_Id'].astype(str) + "_" + df_transformed['Store_Id'].astype(str)
        # else:
        #     keys = [f"row_{i}" for i in range(len(df_transformed))]

        # return jsonify(dict(zip(keys, rounded_preds)))

    except Exception as e:
        print(f"Error during batch prediction: {e}") # Added print for debugging
        return jsonify({"error": str(e), "message": "Prediction failed"}), 500

# Run the Flask app
if __name__ == '__main__':
    superkart_api.run(host="0.0.0.0", port=7860)
Overwriting /content/deployment_files/Backend/app.py

Dependencies File

In [75]:
%%writefile /content/deployment_files/Backend/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.28.1
uvicorn[standard]
huggingface_hub==0.20.3
streamlit==1.43.2
Overwriting /content/deployment_files/Backend/requirements.txt

Dockerfile

In [76]:
%%writefile /content/deployment_files/Backend/Dockerfile

# Base image
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy files
COPY . .

# Install dependencies
RUN pip install -r requirements.txt

# Expose port used by Flask
EXPOSE 7860

# Run the app
CMD ["python", "app.py"]
Overwriting /content/deployment_files/Backend/Dockerfile

Setting up a Hugging Face Docker Space for the Backend

In [77]:
# Login to Hugging Face account using access token
login(token=access_key)

# Try to create the repository for the Hugging Face Space
try:
    create_repo("RevenuePredictionBackend",
        repo_type="space",  # Specify the repository type as "space"
        space_sdk="docker",  # Specify the space SDK as "docker" to create a Docker space
        private=False  # Set to True if you want the space to be private
    )
except Exception as e:
    # Handle potential errors during repository creation
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Skipping creation.")
    else:
        print(f"Error creating repository: {e}")
Error creating repository: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-684681ad-7a7ab2fb2342d294258585f9;20e7ff0f-5d2e-4de9-b392-eb6c56ffd32d)

You already created this space repo

Uploading Files to Hugging Face Space (Docker Space)

In [78]:
repo_id = "omoral02/RevenuePredictionBackend"  # Hugging Face space id

# Login to Hugging Face platform with the access token
login(token=access_key)

# Initialize the API
api = HfApi()

# Upload Streamlit app files stored in the folder called deployment_files
api.upload_folder(
    folder_path="/content/deployment_files/Backend",  # Local folder path
    repo_id=repo_id,  # Hugging face space id
    repo_type="space",  # Hugging face repo type "space"
)
Out[78]:
CommitInfo(commit_url='https://huggingface.co/spaces/omoral02/RevenuePredictionBackend/commit/dffad5b69290f9775dfce112406935382a74cbfc', commit_message='Upload folder using huggingface_hub', commit_description='', oid='dffad5b69290f9775dfce112406935382a74cbfc', pr_url=None, repo_url=RepoUrl('https://huggingface.co/spaces/omoral02/RevenuePredictionBackend', endpoint='https://huggingface.co', repo_type='space', repo_id='omoral02/RevenuePredictionBackend'), pr_revision=None, pr_num=None)

Deployment - Frontend

Points to note before executing the below cells

  • Create a Streamlit space on Hugging Face by following the instructions provided on the content page titled Creating Spaces and Adding Secrets in Hugging Face from Week 1

Streamlit for Interactive UI

In [79]:
os.makedirs("/content/deployment_files/Frontend", exist_ok=True)
In [80]:
%%writefile /content/deployment_files/Frontend/app.py
import streamlit as st
import pandas as pd
import joblib
import numpy as np
import requests

# UI Title and Subtitle
st.title("🛒 SuperKart Sales Forecasting App")
st.write("This tool predicts **product-level revenue** in a specific store using historical and categorical inputs.")

# UI for Input Features
st.subheader("Enter Product & Store Details:")

# Categorical Inputs
product_type = st.selectbox("Product Type", [
    "Meat", "Snack Foods", "Soft Drinks", "Dairy", "Household", "Fruits and Vegetables",
    "Frozen Foods", "Breakfast", "Baking Goods", "Health and Hygiene", "Starchy Foods"
])

store_type = st.selectbox("Store Type", [
    "Supermarket Type1", "Supermarket Type2", "Supermarket Type3", "Grocery Store"
])

city_type = st.selectbox("City Type", ["Tier 1", "Tier 2", "Tier 3"])
store_size = st.selectbox("Store Size", ["Small", "Medium", "High"])
sugar_content = st.selectbox("Product Sugar Content", ["No Sugar", "Low Sugar", "Regular"])

# Numerical Inputs
product_weight = st.number_input("Product Weight (kg)", min_value=0.0, max_value=50.0, value=10.0, step=0.1)
product_mrp = st.number_input("Product MRP", min_value=0.0, max_value=1000.0, value=200.0, step=1.0)
allocated_area = st.number_input("Allocated Display Area (0-1)", min_value=0.0, max_value=1.0, value=0.2, step=0.01)
store_est_year = st.number_input("Store Establishment Year", min_value=1950, max_value=2025, value=2010)

# Convert to DataFrame
input_data = pd.DataFrame({
    'Product_Type': [product_type],
    'Store_Type': [store_type],
    'Store_Location_City_Type': [city_type],
    'Store_Size': [store_size],
    'Product_Sugar_Content': [sugar_content],
    'Product_Weight': [product_weight],
    'Product_MRP': [product_mrp],
    'Product_Allocated_Area': [allocated_area],
    'Store_Establishment_Year': [store_est_year],
})

# Make prediction when the "Predict" button is clicked
if st.button("Predict"):
    response = requests.post("https://omoral02-RevenuePredictionBackend.hf.space/v1/predict", json=input_data.to_dict(orient='records')[0])  # Send data to Flask API
    if response.status_code == 200:
        prediction = response.json()['Predicted_Store_Sales_Total']
        st.success(f"Predicted Revenue (in dollars): {prediction}")
    else:
        st.error("Error making prediction.")

# Section for batch prediction
st.subheader("Batch Prediction")

# Allow users to upload a CSV file for batch prediction
uploaded_file = st.file_uploader("Upload CSV file for batch prediction", type=["csv"])

# Make batch prediction when the "Predict Batch" button is clicked
if uploaded_file is not None:
    if st.button("Predict Batch"):
        response = requests.post("https://omoral02-RevenuePredictionBackend.hf.space/v1/batch", files={"file": uploaded_file})  # Send file to Flask API
        if response.status_code == 200:
            predictions = response.json()
            st.success("Batch predictions completed!")
            st.write(predictions)  # Display the predictions
        else:
            st.error("Error making batch prediction.")
Overwriting /content/deployment_files/Frontend/app.py

Dependencies File

In [81]:
%%writefile /content/deployment_files/Frontend/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
streamlit==1.43.2
requests==2.28.1
Overwriting /content/deployment_files/Frontend/requirements.txt

DockerFile

In [82]:
%%writefile /content/deployment_files/Frontend/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

# Define the command to run the Streamlit app on port 7860 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

# NOTE: Disable XSRF protection for easier external access in order to make batch predictions
Overwriting /content/deployment_files/Frontend/Dockerfile

Uploading Files to Hugging Face Space (Streamlit Space)

In [83]:
# Login to Hugging Face account using access token
login(token=access_key)

# Try to create the repository for the Hugging Face Space
try:
    create_repo("RevenuePredictionFrontend",  # One can replace "Backend_Docker_space" with the desired space name
        repo_type="space",  # Specify the repository type as "space"
        space_sdk="docker",  # Specify the space SDK as "docker" to create a Docker space
        private=False  # Set to True if you want the space to be private
    )
except Exception as e:
    # Handle potential errors during repository creation
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Skipping creation.")
    else:
        print(f"Error creating repository: {e}")
Error creating repository: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-684681b0-128d1a7766b8d8b405f637a1;993c56c3-4d54-47c7-8421-4e4f7e5a336f)

You already created this space repo
In [84]:
repo_id = "omoral02/RevenuePredictionFrontend"  # Your Hugging Face space id

# Login to Hugging Face platform with the access token
login(token=access_key)

# Initialize the API
api = HfApi()

# Upload Streamlit app files stored in the folder called deployment_files
api.upload_folder(
    folder_path="/content/deployment_files/Frontend",  # Local folder path
    repo_id=repo_id,  # Hugging face space id
    repo_type="space",  # Hugging face repo type "space"
)
No files have been modified since last commit. Skipping to prevent empty commit.
WARNING:huggingface_hub.hf_api:No files have been modified since last commit. Skipping to prevent empty commit.
Out[84]:
CommitInfo(commit_url='https://huggingface.co/spaces/omoral02/RevenuePredictionFrontend/commit/36cda40719b97344b570454a72c7148ee51fc8cc', commit_message='Upload folder using huggingface_hub', commit_description='', oid='36cda40719b97344b570454a72c7148ee51fc8cc', pr_url=None, repo_url=RepoUrl('https://huggingface.co/spaces/omoral02/RevenuePredictionFrontend', endpoint='https://huggingface.co', repo_type='space', repo_id='omoral02/RevenuePredictionFrontend'), pr_revision=None, pr_num=None)

Actionable Insights and Business Recommendations

Sample GET + POST requests

In [85]:
# Lets write a GET try block
try:
  get = requests.get("https://omoral02-RevenuePredictionBackend.hf.space")
  print(get.status_code)
  print(get.headers)
  print(get.json())
except Exception as e:
  print(e)
200
{'Date': 'Mon, 09 Jun 2025 06:39:46 GMT', 'Content-Type': 'application/json', 'Content-Length': '58', 'Connection': 'keep-alive', 'server': 'Werkzeug/2.2.2 Python/3.9.23', 'x-proxied-host': 'http://10.108.71.225', 'x-proxied-replica': 'guop8thg-mc4vx', 'x-proxied-path': '/', 'link': '<https://huggingface.co/spaces/omoral02/RevenuePredictionBackend>;rel="canonical"', 'x-request-id': 'S7IWoE', 'vary': 'origin, access-control-request-method, access-control-request-headers', 'access-control-allow-credentials': 'true'}
{'okay': 'Welcome to the SuperKart Sales Prediction API!'}
In [86]:
sample = {
    'Product_Type': 'Snack Foods',
    'Store_Type': 'Supermarket Type1',
    'Store_Location_City_Type': 'Tier 1',
    'Store_Size': 'Medium',
    'Product_Sugar_Content': 'Low Sugar',
    'Product_Weight': 9.5,
    'Product_MRP': 150.0,
    'Product_Allocated_Area': 0.25,
    'Store_Establishment_Year': 2010
}

try:
  post = requests.post("https://omoral02-RevenuePredictionBackend.hf.space/v1/predict", json=sample)
  print(post.status_code)
  print(post.json())
except Exception as e:
  print(e)
200
{'Predicted_Store_Sales_Total': 2979.93}

Batch Prediction Test

A test CSV file SuperKart_batch.csv was generated with 10 realistic entries based on the existing data schema. Each row includes:

  • Product category and type
  • Store tier and size
  • Sugar content, weight, and MRP
  • Allocated display area
  • Year of store establishment

The batch file allows for evaluating model inference via a POST request to the backend Flask API.

Product_Type Store_Type Store_Location_City_Type Store_Size Product_Sugar_Content Product_Weight Product_MRP Product_Allocated_Area Store_Establishment_Year
Snack Foods Supermarket Type1 Tier 2 Medium Low Sugar 10.2 153 0.07 2009
Health and Hygiene Supermarket Type2 Tier 2 High Regular 13.8 182.6 0.12 2005
Canned Supermarket Type1 Tier 2 High No Sugar 11.4 143.8 0.10 1999
In [87]:
with open("/content/drive/MyDrive/SuperKart_Batch_.csv", "rb") as f:
    response = requests.post(
        "https://omoral02-RevenuePredictionBackend.hf.space/v1/batch",
        files={"file": f}
    )

print("Status Code:", response.status_code)
print("Response:", response.json())
Status Code: 200
Response: {'Predicted_Store_Sales_Total': [3132.66, 4422.17, 3168.39]}

Hugginface Streamlit Frontend UI

UI - Frontend

Huggingface Model

Serialized Model

Huggingface Flask Backend

Backend API

Final Business Insights - Recommendations

Model Selection Justification: xgb_tuned

Performance Tradeoff: While xgb_final (with binning + dummy variables) explored deeper feature engineering, it showed degraded generalization (overfitting and poorer R² on test set).

xgb_tuned delivers balanced performance on both train and test sets without overfitting, supported by robust regularization and tree depth constraints.


Metrics Snapshot:

✅ R² on Test: ~0.93

✅ MAPE: ~4.3%

✅ MAE: ~109

➤ These metrics suggest reliable and stable generalization, suitable for production.


Business Recommendations

Dynamic Inventory Planning Use predicted revenue per product to prioritize stock allocation by product category and store location type.

High Product_MRP with specific Store_Types (e.g., Supermarket Type2 + Tier 2) yield maximum ROI.

Product Portfolio Optimization Categories like Dairy, Snack Foods, Meat, and Starchy Foods showed significant weight in predictions.

Consider deeper segmentation within these for micro-forecasting.

Store Expansion Planning Leverage insights from Store_Size, Store_Establishment_Year, and Store_Location_City_Type:

Older stores in Tier 2 regions consistently show higher normalized revenue per unit MRP.

Prioritize medium-sized stores in Tier 2 cities for expansion.


Merchandising & Promotions

Products with higher Product_Allocated_Area and Product_MRP have nonlinear effects; useful for promotional bundling or pricing.


Deployment Notes

xgb_tuned was serialized and integrated into a Fast API + Streamlit + Hugging Face deployment flow.

Batch and single POST endpoints are functional and input-schema compliant.

Future model refresh cycles can use GridSearchCV within a CI pipeline and SHAP logging for continuous drift monitoring.


SHAP analysis will confirm:

Top Drivers: [Product_Type, Product_MRP, Store_Type, Store_Size, Product_Sugar_Content]

These features consistently rank high in model importance and business relevance.